The German credit risk data can be downloaded from the UCI Machine Learning repository. The data set has 1000 observations with 21 variables. There are categorical and numeric variables in this dataset.
The steps include, converting data to the required data types, using interpretable class labels, checking and omitting NAs in the data (if any).
Let’s look at the data! The data has 1000 rows and 21 variables
| Variables in the data |
|---|
| Status_checking_account |
| Duration_in_month |
| Credit_history |
| Purpose |
| Credit_amount |
| Savings_account_bonds |
| Present_employment_since |
| Installment_rate_in_percentage_of_disp_income |
| Personal_status_and_sex |
| Guarantors |
| Present_residence_since |
| Property |
| Age |
| Other_installment_plans |
| Housing |
| Number_of_existing_credits_at_this_bank |
| Job |
| Number_of_dependants |
| Telephone |
| foreign_worker |
| Credit_Risk |
Looking at the data summary for numeric variables
| Duration_in_month | Credit_amount | Age | |
|---|---|---|---|
| Min. : 4.0 | Min. : 250 | Min. :19.00 | |
| 1st Qu.:12.0 | 1st Qu.: 1366 | 1st Qu.:27.00 | |
| Median :18.0 | Median : 2320 | Median :33.00 | |
| Mean :20.9 | Mean : 3271 | Mean :35.55 | |
| 3rd Qu.:24.0 | 3rd Qu.: 3972 | 3rd Qu.:42.00 | |
| Max. :72.0 | Max. :18424 | Max. :75.00 |
Credit risk is the outcome variable. The frequency table for each variable vs the Credit risk is shown below. Cell counts also show per row proportions, for example: In case of the foreign worker variable, 30.7% of the foreign workers have credit risk label as “bad” and 69.3% of the foreign workers have “good” credit risk label.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Status_checking_account | Bad | Good | Row Total |
## ------------------------|-----------|-----------|-----------|
## No_account | 46 | 348 | 394 |
## | 0.117 | 0.883 | 0.394 |
## ------------------------|-----------|-----------|-----------|
## lt_0 | 135 | 139 | 274 |
## | 0.493 | 0.507 | 0.274 |
## ------------------------|-----------|-----------|-----------|
## lt_200 | 105 | 164 | 269 |
## | 0.390 | 0.610 | 0.269 |
## ------------------------|-----------|-----------|-----------|
## gte_200 | 14 | 49 | 63 |
## | 0.222 | 0.778 | 0.063 |
## ------------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ------------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Credit_history | Bad | Good | Row Total |
## -----------------------|-----------|-----------|-----------|
## Critical | 50 | 243 | 293 |
## | 0.171 | 0.829 | 0.293 |
## -----------------------|-----------|-----------|-----------|
## delayed_in_past | 28 | 60 | 88 |
## | 0.318 | 0.682 | 0.088 |
## -----------------------|-----------|-----------|-----------|
## No_credit_due | 25 | 15 | 40 |
## | 0.625 | 0.375 | 0.040 |
## -----------------------|-----------|-----------|-----------|
## All_paid_duly | 28 | 21 | 49 |
## | 0.571 | 0.429 | 0.049 |
## -----------------------|-----------|-----------|-----------|
## All_existing_paid_duly | 169 | 361 | 530 |
## | 0.319 | 0.681 | 0.530 |
## -----------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -----------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Purpose | Bad | Good | Row Total |
## -------------|-----------|-----------|-----------|
## Appliances | 4 | 8 | 12 |
## | 0.333 | 0.667 | 0.012 |
## -------------|-----------|-----------|-----------|
## Business | 34 | 63 | 97 |
## | 0.351 | 0.649 | 0.097 |
## -------------|-----------|-----------|-----------|
## Education | 22 | 28 | 50 |
## | 0.440 | 0.560 | 0.050 |
## -------------|-----------|-----------|-----------|
## Furniture | 58 | 123 | 181 |
## | 0.320 | 0.680 | 0.181 |
## -------------|-----------|-----------|-----------|
## New.car | 89 | 145 | 234 |
## | 0.380 | 0.620 | 0.234 |
## -------------|-----------|-----------|-----------|
## Others | 5 | 7 | 12 |
## | 0.417 | 0.583 | 0.012 |
## -------------|-----------|-----------|-----------|
## Repairs | 8 | 14 | 22 |
## | 0.364 | 0.636 | 0.022 |
## -------------|-----------|-----------|-----------|
## Retraining | 1 | 8 | 9 |
## | 0.111 | 0.889 | 0.009 |
## -------------|-----------|-----------|-----------|
## Television | 62 | 218 | 280 |
## | 0.221 | 0.779 | 0.280 |
## -------------|-----------|-----------|-----------|
## Used.car | 17 | 86 | 103 |
## | 0.165 | 0.835 | 0.103 |
## -------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Savings_account_bonds | Bad | Good | Row Total |
## ----------------------|-----------|-----------|-----------|
## No_savings | 32 | 151 | 183 |
## | 0.175 | 0.825 | 0.183 |
## ----------------------|-----------|-----------|-----------|
## lt_100 | 217 | 386 | 603 |
## | 0.360 | 0.640 | 0.603 |
## ----------------------|-----------|-----------|-----------|
## 100_500 | 34 | 69 | 103 |
## | 0.330 | 0.670 | 0.103 |
## ----------------------|-----------|-----------|-----------|
## 500_1000 | 11 | 52 | 63 |
## | 0.175 | 0.825 | 0.063 |
## ----------------------|-----------|-----------|-----------|
## gt_1000 | 6 | 42 | 48 |
## | 0.125 | 0.875 | 0.048 |
## ----------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ----------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Present_employment_since | Bad | Good | Row Total |
## -------------------------|-----------|-----------|-----------|
## Unemployed | 23 | 39 | 62 |
## | 0.371 | 0.629 | 0.062 |
## -------------------------|-----------|-----------|-----------|
## 1_yr | 70 | 102 | 172 |
## | 0.407 | 0.593 | 0.172 |
## -------------------------|-----------|-----------|-----------|
## 4_yr | 104 | 235 | 339 |
## | 0.307 | 0.693 | 0.339 |
## -------------------------|-----------|-----------|-----------|
## 7_yr | 39 | 135 | 174 |
## | 0.224 | 0.776 | 0.174 |
## -------------------------|-----------|-----------|-----------|
## gt_7_yr | 64 | 189 | 253 |
## | 0.253 | 0.747 | 0.253 |
## -------------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -------------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Installment_rate_in_percentage_of_disp_income | Bad | Good | Row Total |
## ----------------------------------------------|-----------|-----------|-----------|
## 0_20 | 159 | 317 | 476 |
## | 0.334 | 0.666 | 0.476 |
## ----------------------------------------------|-----------|-----------|-----------|
## 20_25 | 45 | 112 | 157 |
## | 0.287 | 0.713 | 0.157 |
## ----------------------------------------------|-----------|-----------|-----------|
## 25_35 | 62 | 169 | 231 |
## | 0.268 | 0.732 | 0.231 |
## ----------------------------------------------|-----------|-----------|-----------|
## 35_plus | 34 | 102 | 136 |
## | 0.250 | 0.750 | 0.136 |
## ----------------------------------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ----------------------------------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Personal_status_and_sex | Bad | Good | Row Total |
## ------------------------|-----------|-----------|-----------|
## Male.divorced | 20 | 30 | 50 |
## | 0.400 | 0.600 | 0.050 |
## ------------------------|-----------|-----------|-----------|
## Female.divorced | 109 | 201 | 310 |
## | 0.352 | 0.648 | 0.310 |
## ------------------------|-----------|-----------|-----------|
## male.single | 146 | 402 | 548 |
## | 0.266 | 0.734 | 0.548 |
## ------------------------|-----------|-----------|-----------|
## male.married | 25 | 67 | 92 |
## | 0.272 | 0.728 | 0.092 |
## ------------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ------------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Guarantors | Bad | Good | Row Total |
## -------------|-----------|-----------|-----------|
## none | 272 | 635 | 907 |
## | 0.300 | 0.700 | 0.907 |
## -------------|-----------|-----------|-----------|
## co_applicant | 18 | 23 | 41 |
## | 0.439 | 0.561 | 0.041 |
## -------------|-----------|-----------|-----------|
## guarantor | 10 | 42 | 52 |
## | 0.192 | 0.808 | 0.052 |
## -------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Present_residence_since | Bad | Good | Row Total |
## ------------------------|-----------|-----------|-----------|
## lt_1_yr | 36 | 94 | 130 |
## | 0.277 | 0.723 | 0.130 |
## ------------------------|-----------|-----------|-----------|
## 1_4yr | 97 | 211 | 308 |
## | 0.315 | 0.685 | 0.308 |
## ------------------------|-----------|-----------|-----------|
## 4_7yr | 43 | 106 | 149 |
## | 0.289 | 0.711 | 0.149 |
## ------------------------|-----------|-----------|-----------|
## gt_7_yr | 124 | 289 | 413 |
## | 0.300 | 0.700 | 0.413 |
## ------------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ------------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Property | Bad | Good | Row Total |
## -------------|-----------|-----------|-----------|
## No.property | 67 | 87 | 154 |
## | 0.435 | 0.565 | 0.154 |
## -------------|-----------|-----------|-----------|
## Real.estate | 60 | 222 | 282 |
## | 0.213 | 0.787 | 0.282 |
## -------------|-----------|-----------|-----------|
## insurance | 71 | 161 | 232 |
## | 0.306 | 0.694 | 0.232 |
## -------------|-----------|-----------|-----------|
## car | 102 | 230 | 332 |
## | 0.307 | 0.693 | 0.332 |
## -------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Other_installment_plans | Bad | Good | Row Total |
## ------------------------|-----------|-----------|-----------|
## None | 224 | 590 | 814 |
## | 0.275 | 0.725 | 0.814 |
## ------------------------|-----------|-----------|-----------|
## banks | 57 | 82 | 139 |
## | 0.410 | 0.590 | 0.139 |
## ------------------------|-----------|-----------|-----------|
## stores | 19 | 28 | 47 |
## | 0.404 | 0.596 | 0.047 |
## ------------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ------------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Housing | Bad | Good | Row Total |
## -------------|-----------|-----------|-----------|
## Free | 44 | 64 | 108 |
## | 0.407 | 0.593 | 0.108 |
## -------------|-----------|-----------|-----------|
## Rent | 70 | 109 | 179 |
## | 0.391 | 0.609 | 0.179 |
## -------------|-----------|-----------|-----------|
## Own | 186 | 527 | 713 |
## | 0.261 | 0.739 | 0.713 |
## -------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Number_of_existing_credits_at_this_bank | Bad | Good | Row Total |
## ----------------------------------------|-----------|-----------|-----------|
## 1 | 200 | 433 | 633 |
## | 0.316 | 0.684 | 0.633 |
## ----------------------------------------|-----------|-----------|-----------|
## 2 | 92 | 241 | 333 |
## | 0.276 | 0.724 | 0.333 |
## ----------------------------------------|-----------|-----------|-----------|
## 3 | 6 | 22 | 28 |
## | 0.214 | 0.786 | 0.028 |
## ----------------------------------------|-----------|-----------|-----------|
## 4 | 2 | 4 | 6 |
## | 0.333 | 0.667 | 0.006 |
## ----------------------------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ----------------------------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Job | Bad | Good | Row Total |
## ------------------|-----------|-----------|-----------|
## Unemployed_NonRes | 7 | 15 | 22 |
## | 0.318 | 0.682 | 0.022 |
## ------------------|-----------|-----------|-----------|
## Unskilled_Res | 56 | 144 | 200 |
## | 0.280 | 0.720 | 0.200 |
## ------------------|-----------|-----------|-----------|
## skilled | 186 | 444 | 630 |
## | 0.295 | 0.705 | 0.630 |
## ------------------|-----------|-----------|-----------|
## management | 51 | 97 | 148 |
## | 0.345 | 0.655 | 0.148 |
## ------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Number_of_dependants | Bad | Good | Row Total |
## ---------------------|-----------|-----------|-----------|
## lt_2 | 46 | 109 | 155 |
## | 0.297 | 0.703 | 0.155 |
## ---------------------|-----------|-----------|-----------|
## gt_2 | 254 | 591 | 845 |
## | 0.301 | 0.699 | 0.845 |
## ---------------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ---------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Telephone | Bad | Good | Row Total |
## -------------|-----------|-----------|-----------|
## No | 187 | 409 | 596 |
## | 0.314 | 0.686 | 0.596 |
## -------------|-----------|-----------|-----------|
## Yes | 113 | 291 | 404 |
## | 0.280 | 0.720 | 0.404 |
## -------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## foreign_worker | Bad | Good | Row Total |
## ---------------|-----------|-----------|-----------|
## No | 4 | 33 | 37 |
## | 0.108 | 0.892 | 0.037 |
## ---------------|-----------|-----------|-----------|
## Yes | 296 | 667 | 963 |
## | 0.307 | 0.693 | 0.963 |
## ---------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## ---------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | Credit_Risk
## Credit_Risk | Bad | Good | Row Total |
## -------------|-----------|-----------|-----------|
## Bad | 300 | 0 | 300 |
## | 1.000 | 0.000 | 0.300 |
## -------------|-----------|-----------|-----------|
## Good | 0 | 700 | 700 |
## | 0.000 | 1.000 | 0.700 |
## -------------|-----------|-----------|-----------|
## Column Total | 300 | 700 | 1000 |
## -------------|-----------|-----------|-----------|
##
##
• Chi-sq test of independence: to test whether two categorical variables are dependent or not. It evaluates whether there is a significant association between the categories of the two variables. A p-value less than 0.05(significance threshold) implies that the two variables are significantly associated to each other.
| p.values | |
|---|---|
| Status_checking_account | 0.0000000 |
| Credit_history | 0.0000000 |
| Purpose | 0.0001157 |
| Savings_account_bonds | 0.0000003 |
| Present_employment_since | 0.0010455 |
| Installment_rate_in_percentage_of_disp_income | 0.1400333 |
| Personal_status_and_sex | 0.0222380 |
| Guarantors | 0.0360560 |
| Present_residence_since | 0.8615521 |
| Property | 0.0000286 |
| Other_installment_plans | 0.0016293 |
| Housing | 0.0001117 |
| Number_of_existing_credits_at_this_bank | 0.4451441 |
| Job | 0.5965816 |
| Number_of_dependants | 0.9240463 |
| Telephone | 0.2488438 |
| foreign_worker | 0.0094431 |
| Credit_Risk | 0.0000000 |
Let’s understand the data from the plots.
• Barplots: for categorical data showing the frequency color coded based on the outcome variable (Credit risk)
• Boxplots for numeric data showing the distributions color coded based on the outcome variable (Credit risk)